26 research outputs found
Low-Rank Softmax Can Have Unargmaxable Classes in Theory but Rarely in Practice
Classifiers in natural language processing (NLP) often have a large number of output classes. For example, neural language models (LMs) and machine translation (MT) models both predict tokens from a vocabulary of thousands. The Softmax output layer of these models typically receives as input a dense feature representation, which has much lower dimensionality than the output. In theory, the result is some words may be impossible to be predicted via argmax, irrespective of input features, and empirically, there is evidence this happens in small language models (Demeter et al., 2020). In this paper we ask whether it can happen in practical large language models and translation models. To do so, we develop algorithms to detect such unargmaxable tokens in public models. We find that 13 out of 150 models do indeed have such tokens; however, they are very infrequent and unlikely to impact model quality. We release our algorithms and code to the public
Fast machine translation on parallel and massively parallel hardware
Parallel systems have been widely adopted in the field of machine translation, because
the raw computational power they offer is well suited to this computationally intensive
task. However programming for parallel hardware is not trivial as it requires redesign
of the existing algorithms. In my thesis I design efficient algorithms for machine translation
on parallel hardware. I identify memory accesses as the biggest bottleneck to
processing speed and propose novel algorithms that minimize them. I present three distinct
case studies in which minimizing memory access substantially improves speed:
Starting with statistical machine translation, I design a phrase table that makes decoding
ten times faster on a multi-threaded CPU. Next, I design a GPU-based n-gram
language model that is twice as fast per £ as a highly optimized CPU implementation.
Turning to neural machine translation, I design new stochastic gradient descent techniques
that make end-to-end training twice as fast. The work in this thesis has been
incorporated in two popular machine translation toolkits: Moses and Marian
Character Mapping and Ad-hoc Adaptation: Edinburgh's IWSLT 2020 Open Domain Translation System
This paper describes the University of Edinburgh’s neural machine translation systems submitted to the IWSLT 2020 open domain Japanese Chinese translation task. On top of commonplace techniques like tokenisation and corpus cleaning, we explore character mapping and unsupervised decoding-time adaptation. Our techniques focus on leveraging the provided data, and we show the positive impact of each technique through the gradual improvement of BLEU
An Open Dataset and Model for Language Identification
Language identification (LID) is a fundamental step in many natural language
processing pipelines. However, current LID systems are far from perfect,
particularly on lower-resource languages. We present a LID model which achieves
a macro-average F1 score of 0.93 and a false positive rate of 0.033 across 201
languages, outperforming previous work. We achieve this by training on a
curated dataset of monolingual data, the reliability of which we ensure by
auditing a sample from each source and each language manually. We make both the
model and the dataset available to the research community. Finally, we carry
out detailed analysis into our model's performance, both in comparison to
existing open models and by language class.Comment: To be published in ACL 202
In Neural Machine Translation, What Does Transfer Learning Transfer?
Transfer learning improves quality for low-resource machine translation, but it is unclear what exactly it transfers. We perform several ablation studies that limit information transfer, then measure the quality impact across three language pairs to gain a black-box understanding of transfer learning. Word embeddings play an important role in transfer learning, particularly if they are properly aligned. Although transfer learning can be performed without embeddings, results are sub-optimal. In contrast, transferring only the embeddings but nothing else yields catastrophic results. We then investigate diagonal alignments with auto-encoders over real languages and randomly generated sequences, finding even randomly generated sequences as parents yield noticeable but smaller gains. Finally, transfer learning can eliminate the need for a warm-up phase when training transformer models in high resource language pairs